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Graphics Processing Units (GPUs) are being used in many areas of 
physics, since the performance versus cost is very attractive. The GPUs 
can be addressed by CUDA which is a NVIDIA's parallel computing ar- 
chitecture. It enables dramatic increases in computing performance by 
harnessing the power of the GPU. We present a performance comparison 
between the GPU and CPU with single precision and double precision in 
generating lattice SU(2) configurations. Analyses with single and multiple 
GPUs, using CUDA and OPENMP, are also presented. We also present 
SU(2) results for the renormalized Polyakov loop, colour averaged free en- 
ergy and the string tension as a function of the temperature. 

PACS numbers: ll.15.Ha; 12.38.Gc 

1. Introduction 

Since the first release of CUDA (Compute Unified Device Architecture) 
by NVIDIA, the GPUs (Graphics Processing Units) are being addressed 
for physics computing in different areas where the performance is relevant. 
CUDA gives developers access to the GPU by virtual instruction set and 
memory of computational elements. Whereas the CPU was projected for 
executing a single thread very quickly, the GPU architecture was projected 
to execute many concurrent threads slowly. 

The most successful theories that describe elementary particle physics 
are the so called gauge theories. SU(2) is an interesting gauge group, either 
to simulate the electroweak theory, or to use as a simplified case of the SU(3) 
gauge group of the strong interaction. 

However, generating SU(N) lattice configurations is a highly computa- 
tionally demanding task and requires advanced computer architectures such 
as CPU clusters or GPUs. 
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Nevertheless, GPUs are easier to access and maintain, as they can run 
on a local desktop computer, compared with CPU clusters. 

This paper is divided in 3 sections. In section 2, we present the perfor- 
mance results and the results of the Polyakov loop, the colour averaged free 
energy and the string tension as well as a brief description how to calculate 
them. For a more detailed description on how to generate lattice SU(2) 
configurations in GPUs see [1]. In section 3, we conclude. 

2. Results 

We implemented our code in CUDA language to run in one GPU or in 
several GPUs with OPEMMP. The code was tested in two different archi- 
tectures, NVIDIA 295 GTX and NVIDIA 480 GTX cards, see Table 1. 



NVIDIA Geforce GTX 


295 (GT200) 


480 (Fermi) 


Number of GPUs 


2 


1 


CUDA Capability 


1.3 


2.0 


Number of cores 


2x240 


480 


Global memory 


896 MB per GPU 


1536 MB 


Number of threads per block 


512 


1024 


Registers per block 


16384 


32768 


Shared memory (per SM) 


16KB 


48KB or 16KB 


LI cache (per SM) 


None 


16KB or 48KB 


L2 cache (per SM) 


None 


768KB 


Clock rate 


1.37 GHz 


1.40 GHz 



Table 1: NVIDIA's architecture specifications (SM means Streaming Multi- 
processor) . 



2.1. Performance 

In order to test the GPU performance, we measure the execution time for 
the CUDA code implementation in one, two GPUs and the serial code in one 
CPU core (CPU Intel^ Core^™) i7 CPU 920, 2.67GHz, 8 MB of L2 Cache 
and 12GB of RAM) for different lattice sizes at f3 = 6.0 with random SU(2) 
matrix initialization followed by 100 iterations of the heat bath method and 
the calculation of the mean average plaquette at each iteration, see Fig. 1. 
For a more detailed overview see [1]. 

2.2. Finite Temperature 

The Polyakov loop, (L), is an order parameter for the deconfinement 
transition, [2], it measures the free energy, F q , of a single static quark at 
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(a) Single precision. (b) Double precision. 

Figure 1: Performance results. 295 - NVIDIA Geforce 295 GTX; 480 
NVIDIA Geforce 480 GTX; (1) - with 1 GPU; (2) - with 2 GPUs; Tex 
using textures; GM - using global memory. 



temperature T, 

(L)<xexp(-^j (1) 

where T is connected to the lattice spacing a by T = l/(aiVf). 

The results for the Polyakov loop, Fig. 2a, show a dependence on the 
extension of the lattice in time direction. This is due to the self-energy 
contribution of the static quark source used as order parameter. 

Elimination of this self energy term is necessary to obtain an order pa- 
rameter which is a function of the temperature alone. 

This can be done using the renormalization procedure described in [3] 
and using the values of [4] obtained for the effective potential as the seed 
values. The renormalized Polyakov loop can be written as 

(V) = (Z(g 2 )) Nt (L) (2) 

where the renormalization constants Z(g 2 ) should only depend on the bare 
coupling and fitting the values of Z(g 2 ) obtained with this procedure with 
Z(g 2 ) = exp (Ag 2 + Bg i ) 1 we obtain A = 0.0637(18) and B = 0.0731(16) 
with x 2 /dof = 1.16613 for g 2 < 1.3. Applying this last results to all of our 
results in Fig. 2a, we obtain a renormalized Polyakov loop, Fig. 2b, which 
is independent of the extension of the lattice in the time direction. At high 
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Figure 2: Polyakov loop, (a) — unrenormalized SU(2) Polyakov loop, (L), at 
finite temperature, (b) — SU(2) Polyakov loop renormalized, the dotted lines 
correspond to the pure gauge Polyakov loop in HTL perturbation theory for 
it/2, it, 2vr. 



temperatures, the renormalized Polyakov loop approachs their corresponding 
HTL result. 

The colour averaged free energy is defined as the correlation between two 
Polyakov loops, 

e -F avg (r,T)/T+C = 1 ^ L{y) TrL t (a ^ (3) 

which is gauge invariant. To eliminate the trivial temperature dependence 
due to the colour trace normalization, we apply i ? avg (r, T) — > i ? avg (r, T) — 
Tin 4. Fitting the F avg (r,T) data in Fig. 3a with F avg (r,T) with ao(T) — 

^ + a(T)r, we show in Fig. 3b the results for cr(T) as a function of the 
temperature. Although the string tension in SU(2) was already addressed 
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Figure 3: (a) — SU(2) color averaged free energy, F avg . (b) — SU(2) string 
tension, cr(T). 



by [5], the number of data points is too low to have a clear overview. We fit 
our results with two different ansatz, a\J 1 — 6 (T/T c ) 2 and a (T c — T) I/ [1 + 



T] and obtain a reasonable \ /dof for the both fits. For the first 
ansatz, we obtain a = 0.6976 ± 0.0176, b = 0.9990 ± 0.0059 and x 2 /dof = 
0.732. In the second, we fix v = 0.63 according the 3D Ising exponent for 
the correlation length and obtain a = 1.5541 ±0.0435, b = -0.5122 ± 0.0576 
and x 2 /dof = 0.598. Nevertheless, we need more data for T < 0.7T C . 



3. Conclusions 

With 2 NVIDIA GTX 480, we were able to obtain more than 200 x 
the performance over one CPU core in single precision. It's not possible 
to generate SU(2) configurations using only the GPU shared memory due 
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to the limited amount of shared memory available. The limited number 
of registers also affects the GPU performance. Using texture memory in 
this problem, we were able to achieve high performance, both in the GPU 
without cache memory and in the GPUs with cache memory. However, in 
the GPUs with cache memory the difference is bigger in double precision 
than in single precision. The occupancy and performance of the GPUs is 
strongly connected to the number of threads per block, registers per thread, 
shared memory per block, memory access, read and writing, patterns. To 
maximize performance it is necessary to ensure that the memory access is 
coalesced and to minimize copies between GPU and CPU memories. 

The renormalized Polyakov loop for iV$ > 4 shows very small dependence 
on the lattice time direction for N t = 4 and low T. The string tension as 
a function of the temperature, <r(T), extracted from the colour averaged 
free energy, for two different spatial lattice sizes does not reveal any volume 
dependence. The string tension for T > T c is zero, however for T < T c is 
temperature dependent. We fit the string tension with two different ansatz, 
however, we need more data for T < 0.7T C . Future work will be dedicated 
to the study of this case. 
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